We can extend the simple linear regression model to accommodate multiple predictors. These are the multiple linear regression models, of which the simple linear regression is just a particular case with a single predictor variable.
In the next sections we will show how to fit multiple regression models with the least square methods in R, and how to interpret the model coefficients.
In an additive multiple linear regression, the expected value of the response \(\mu\) is the sum of multiple predictor variables \(X_i\), each one weighted by its own coefficient \(\beta_i\) :
\[ \begin{align} y_i &\sim Normal(\mu_i, \sigma) \\ \mu_i & = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_j x_{ij} \\ \sigma & = C \end{align} \]
The expression for the predicted value in this model can be written in the compact form:
\[\mu_i = \sum_{j=0}^{J} \beta_{j} x_{ij}\]
Where \(J\) is the number of predictor variables. This is called the linear predictor.
We will use a data set of an experiment done with the biennial plant Ipomopsis aggregata1. Forty plants were assigned at random to two treatments: unprotected from grazing by rabbits and protect from grazing by fences. The fruit yield of each plant (dry mass in mg) was measured at the end of the experiment. To take into account the effect of the size of the plant on fruit production, the diameter of the top of rootstock (mm) was also recorded. Copy the following commands and paste in your R console to read and prepare this data for the analyses:
## Data reading and preparation
ipomopsis <- read.csv("https://raw.githubusercontent.com/lmmx/crawley-r-statistics-book-notes/master/code/ipomopsis.csv")
## More proper names
names(ipomopsis) <- c("diameter","fruit.w","treatment")
## Convert treatment to a factor variable
ipomopsis$treatment <- factor(ipomopsis$treatment,
labels = c("Control", "Fenced"))
Let’s start with a plot of the response variable as function of both predictors:
plot(fruit.w ~ diameter, data = ipomopsis, type ="n",
ylab = "Fruit dry weight (mg)",
xlab = "Rootstock diameter (mm)")
points(fruit.w ~ diameter, data = ipomopsis,
subset = treatment == "Fenced")
points(fruit.w ~ diameter, data = ipomopsis,
subset = treatment == "Control", pch=2, col ="red")
legend("topleft", c("Control", "Fenced"), pch=c(1,2),
col =c("black", "red"), bty="n")
Larger plants tend to produce larger dry biomass of fruits. It also seems that if we control for size, non-fenced plants had larger fruit yields. We can check this hypothesis by fitting a linear regression with the additive effect of both plant size and treatment:
ipo_m1 <- lm(fruit.w ~ diameter + treatment, data = ipomopsis)
Use the function coef to store the coefficients (\(\beta_i\)) of this
fitting in a object:
(ipo_cf1 <- coef(ipo_m1))
(Intercept) diameter treatmentFenced
-127.82936 23.56005 36.10325
The predictor variable “treatment” is a factor with two levels (control/fenced). The default in R is to create dummy variables for each level of the factor, except the first one in alphabetical order. This means that the model has a variable “treatmentFenced”, that has value one for fenced plants, and value zero otherwise. So, fruit yield predicted (\(\widehat{Y}\)) by this model for control plants is:
\[\widehat{Y}_{\text{fenced}} = \beta_0 + \beta_1 \text{diameter} + \beta_2 \times 0\]
While for fenced plants we have:
\[\widehat{Y}_{\text{non-fenced}} = \beta_0 + \beta_1 \text{diameter} + \beta_2 \times 1\]
To add the line of expected yield for fenced plants we can use the
function abline, that draws a straight line with a given intercept and slope:
## Make the plot again
plot(fruit.w ~ diameter, data = ipomopsis, type ="n",
ylab = "Fruit dry weight (mg)",
xlab = "Rootstock diameter (mm)")
points(fruit.w ~ diameter, data = ipomopsis,
subset = treatment == "control")
points(fruit.w ~ diameter, data = ipomopsis,
subset = treatment == "Fenced", pch=2, col ="red")
legend("topleft", c("Control", "Fenced"), pch=c(1,2),
col =c("black", "red"), bty="n")
## Adds the predicted line for fenced plants
## check the function help to use it
abline( a = ipo_cf1[1], b = ipo_cf1[2], col = "red")
Challenge: now add to the plot a black line for the expected yield for non-fenced plants. Your plot should look like this:
From the plot above, we can see that the difference between the yield of plants of the same size is a constant, that corresponds to the value of the coefficient of the variable “treatmentNon-Fenced”. Thus, the fitted model estimates that non-fenced plants had on average more 36 mg of fruit yield, if we control for plant size.
Final Challenge
What if we had not controlled for plant size? Fit a model only with the treatment and interpret the coefficients.
Hint: it is good practice to do exploratory data analyses before any model fitting. In this case a box plot can help, try:
boxplot(fruit.w ~ treatment, data=ipomopsis)
Before making inferences with your fitted model, it is also a good practice to make some model diagnostics. For linear regression, the most basic diagnostic is to check if the residuals follows a normal distribution with zero mean and a constant standard deviation.
Because the normal distribution has a parameter that corresponds to the expected value, and that is independent of the other parameter, the linear regression model we have been writting as:
\[ \begin{align} y_i & \sim Normal(\mu_i, \sigma) \\ \mu_i & = \sum_{j=0}^{J} \beta_{j} x_{ij} \\ \sigma & = C \end{align} \]
Can be also written as:
\[ \begin{align} y_i & = \sum \beta_j x_ij + \epsilon \\ \epsilon & \sim Normal(\mu, \sigma) \\ \mu & = 0 \\ \sigma & = C \end{align} \]
That is, the values of the response \(y_i\) are each a weighted sum of the predictor variables \(x_j\) plus a random Gaussian residual \(\epsilon\). To express the random variation symmetrically around the expected value, this residual has a mean value of zero, and a fixed standard deviation.
To check if this assumption is true, we plot the residuals of the
fitted model as a function of the predicted values. This plot should
show a point cloud of constant width around zero in the Y-axis. You
can check this assumption applying the function plot to the object
that stored the fitted model:
plot(ipo_m1, which =1)
You can also check the normality of the residuals with a qq-plot. Normal data should lie in a straight line in this plot:
plot(ipo_m1, which =2)
You can find an excellent explanation of these and other diagnostic plots for linear regression here
In this example, beds of tulips grown in greenhouses are subjected to different soil moisture and light conditions. The response variable is the size of blooms under these varying conditions.
'data.frame': 27 obs. of 4 variables:
$ bed : Factor w/ 3 levels "a","b","c": 1 1 1 1 1 1 1 1 1 2 ...
$ water : int 1 1 1 2 2 2 3 3 3 1 ...
$ shade : int 1 2 3 1 2 3 1 2 3 1 ...
$ blooms: num 0 0 111 183.5 59.2 ...
The variable water indicates one of three ordered levels of soil
moisture, from low (1) to high (3). The variable shade indicates one
of three ordered levels of light exposure, from high (1) to low (3).
Both predictor factors are crucial for plant growth. In simple
regression settings, we would expect a positive response of bloom size
to increases in soil moisture and light (in this case a negative slope
for shade).
However, how these two variables interact to affect plant growth is also interesting. Higher soil moisture in the absence of light limits photosynthesis, that requires both. Likewise, in the absence of water, higher luminosity does not contribute to plant growth.
The mutual dependency of how predictor variables affect the response can be modeled as an interaction.
The model with interaction has the following notation:
\[ \begin{align} y_i & \sim Normal(\mu_i, \sigma) \\ \mu_i & = \beta_0 \, + \, \beta_1 x_{i1} \, + \, \beta_2 x_{i2} \, + \, \beta_3 x_{i1} x_{i2}\\ \sigma & = C \end{align} \]
Where:
\(y_i\) is each value of the response variable blooms
\(x_{i1}\) is each value of the predictor variable water
\(x_{i2}\) is each value of the predictor variable shade
\(x_{i1} \dot x_{i2}\) is the product of the two explanatory variables, or the interaction term.
\(\beta_j\) are the model coefficients.
Thus, model with interaction adds a third parameter \(\beta_3\) to account for the interaction between the level of watering and shading.
Let’s plot the data:
color <- c("black", "red", "blue")
plot(blooms ~ water,
data = tulips,
type = "n",
ylab = "Bloom numbers",
xlab = "Water level")
for (s in 1:3) {
points(blooms ~ water,
data = tulips,
subset = shade == s, pch = s, col = color[s])
}
legend("topleft",
c("Shade 1", "Shade 2", "Shade 3"),
pch = c(1, 2, 3),
col =c("black", "red", "blue"), bty = "n")
Let’s fit the model with the interaction between the effects of shade and water, and extract its coefficients. The R syntax to include an interaction in a model formula is “variable1:variable2”.
Unlike the additive model, the slopes of the regression model with the interaction factor can vary between groups.
In this example, the interacting (or moderator) variable shade has
only three values (1, 2, 3). We could treat both variable as factors
with three levels, but for learning purporses we will use it here as a
continuous value. Thus the predicted values for the lower levels of
water (water = 1) for any value of shade is:
\[\begin{align} \widehat{Y}_{[\text{water=1}]} & = \beta_0 \; + \; \beta_1 \times 1 \; + \; \beta_2 \times \text{shade} \; + \; \beta_3 \times 1 \times \text{shade} \\ & = (\beta_0 + \beta_1) \; + \; (\beta_2 + \beta_3) \times \text{shade} \end{align} \]
That is, if the value of the variable is one, the intercept of the linear predictor is increased in \(\beta_1\), and the slope of is increased in \(\beta_3\).
The predicted values for water = 2 are:
\[\begin{align} \widehat{Y}_{[\text{water=2}]} & = \beta_0 \; + \; \beta_1 \times 2 \; + \; \beta_2 \times \text{shade} \; + \; \beta_3 \times 2 \times \text{shade} \\ & = (\beta_0 + 2 \beta_1) \; + \; (\beta_2 + 2 \beta_3) \times \text{shade} \end{align} \]
And so on. Therefore, in a model with interaction, the effect of one predictor variable (that is, how much its increase affects the response), depends of the value of some other predictor variable. In this example, the effect of increasing or decreasing shading depends on how moits are the plant beds.
Let’s calculate conditional slopes for the different values of shade (1, 2, and 3). A handy way to do that is:
s <- 1:3 #values of shade
intercepts <- tul_cf2[1] + tul_cf2[3] * s
slopes <- tul_cf2[2] + tul_cf2[4] * s
And now we can plot again the dataset and add lines according to these values.
color <- c("black", "red", "blue")
plot(blooms ~ water,
data = tulips,
type = "n",
ylab = "Bloom numbers",
xlab = "Water level")
for (s in 1:3) {
points(blooms ~ water,
data = tulips,
subset = shade == s, pch = s, col = color[s])
abline(a = intercepts[s], slopes[s], col = color[s])
}
legend("topleft",
c("Shade 1", "Shade 2", "Shade 3"),
pch = c(1, 2, 3),
col = c("black", "red", "blue"), bty = "n")
As you can see, the effect of more soil humidity is steep when there is more light, and it decreases with increasing values of shade.
Extra:
Run the diagnostic plots for this model
Fit and compare the model with only additive effects for this dataset:
\[ \begin{align} y_i & \sim Normal(\mu_i, \sigma) \\ \mu_i & = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}\\ \sigma & = C \end{align} \]
What happens with the coefficients? What happens with the predicted values?